This paper focuses on structured-output learning using deep neural networksfor 3D human pose estimation from monocular images. Our network takes an imageand 3D pose as inputs and outputs a score value, which is high when theimage-pose pair matches and low otherwise. The network structure consists of aconvolutional neural network for image feature extraction, followed by twosub-networks for transforming the image features and pose into a jointembedding. The score function is then the dot-product between the image andpose embeddings. The image-pose embedding and score function are jointlytrained using a maximum-margin cost function. Our proposed framework can beinterpreted as a special form of structured support vector machines where thejoint feature space is discriminatively learned using deep neural networks. Wetest our framework on the Human3.6m dataset and obtain state-of-the-art resultscompared to other recent methods. Finally, we present visualizations of theimage-pose embedding space, demonstrating the network has learned a high-levelembedding of body-orientation and pose-configuration.
展开▼